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The notion of probability density for a random function is not 
as straightforward as in finite-dimensional cases. While a probability 
density function generally does not exist for functional data, we show 
that it is possible to develop the notion of density when functional 
data are considered in the space determined by the eigenfunctions of 
principal component analysis. This leads to a transparent and mean- 
ingful surrogate for density defined in terms of the average value of 
the logarithms of the densities of the distributions of principal compo- 
nents for a given dimension. This density approximation is estimable 
readily from data. It accurately represents, in a monotone way, key 
features of small-ball approximations to density. Our results on es- 
timators of the densities of principal component scores are also of 
independent interest; they reveal interesting shape difi'erences that 
have not previously been considered. The statistical implications of 
these results and properties are identified and discussed, and practical 
ramifications are illustrated in numerical work. 



1. Introduction. The concept of probability density for a random func- 
tion is becoming increasingly important in functional data analysis. For 
example, it underpins discussion of the mode of the distribution of a ran- 
dom function, addressed in particular by Gasser, Hall and Presnell (1998), 
Hall and Heckman (2002) and Dabo-Niang, Ferraty and Vieu (2004a, 2004b, 
2006). Nonpar ametric or structure- free methods for curve estimation from 
functional data involve the concept of density, not least because they gen- 
erally are based on estimators of Nadaraya-Watson type which require di- 
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vision by an estimator of a small-ball probability. See, for example, Ferraty, 
Goia and Vieu (2002a, 2002b, 2007a, 2007b), Ferraty and Vieu (2002, 2003, 
2004, 2006a, 2006b) and Niang (2002). There is of course a more general 
and very large methodology for functional data analysis, accessible via the 
monographs by Ramsay and Silverman (2002, 2005). 

In this paper we take up directly the notions of the density and mode 
of the distribution of a random function. We argue that, while a density 
function is generally not well defined in this context, it is possible to define 
a meaningful concept of density for a specific scale or resolution level which 
is intrinsically linked to a particular dimension in a principal component 
representation. The challenge is to determine the dimension. We give an 
argument which leads directly from scale to dimension, through a simple 
approximation to a small-ball probability at a given scale. 

The density approximation suggests a simple and appealing definition 
of mode and leads directly to an empirical approximation to density for a 
given dimension. Likewise, the approximation also enables two functions to 
be compared on the basis of their "relative likelihoods," that is, the heights 
of the density at the respective functions; although we shall not explore that 
feature in the present paper. We develop theoretical arguments describing 
both the approximation and the estimation of principal component score 
densities, and we give numerical illustrations of our conclusions. 

Our empirical methods involve estimating the densities of principal com- 
ponent scores using approximations to those scores based on estimators of 
eigenvalues and eigenfunctions. This problem is itself of intrinsic interest, not 
least because principal component score densities reveal interesting shape 
differences. Our theoretical results describe properties of density estimators 
in this context. 

The problem of determining the intrinsic dimension of the distribution 
of a random function for given scale is related to that of estimating the 
effective dimension of a sample of p- vectors when the sample size, n, is 
much less than p. Indeed, the connection between very high-dimensional 
data problems, and problems involving functional data, is drawn explicitly 
by Leng and Miiller's (2006) "stringing" method which permits a random 
function to be computed from a long data vector. Leng and Miiller suggest 
that the effective dimension of the transformed data be computed using 
principal component methods which also underpin our analysis. 

This paper is organised as follows. In Section 2 we define a notion of 
density (the "log-density") that can be used in the functional data context, 
and from there we define a modal function which can be used to measure 
central tendency. The log-density depends on the densities of the principal 
component scores, and in Section 3 we show how to estimate these densities 
and study theoretical properties of these estimators. In Section 4 we provide 
theoretical arguments that justify the use of log-density in the functional 
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data context. In Section 5 we use our estimators of the surrogate density, 
of the densities of the principal components scores and of the modal curve, 
to analyse an Australian rainfall dataset. We also illustrate the methods on 
some simulated data. The proofs of the main results are gathered in Section 
6. 

2. Main results and their implications. 

2.1. Decomposition into principal components. As a prelude to sum- 
marising our main results we briefly revise important properties of functional 
principal component decomposition. See Besse and Ramsay (1986), Ramsay 
and Dalzell (1991), Rice and Silverman (1991) and Silverman (1995, 1996) 
for early work on this topic. Let X be a random function supported on a 
compact interval Z. If the covariance function of X is positive definite, it 
admits the following spectral decomposition: 

oo 

(2.1) K{s,t) = coY{X{s),X{t)} = ^%V',(s)Vi(i), 

i=i 

where the expansion converges in L2 on , and 9i>92>--- are the eigen- 
values, with respective orthonormal eigenvectors V'j, of the linear operator 
with kernel K. 

The functions ipi,'tp2, - ■ ■ form a basis for the space of all square- integrable 
functions on I, and, in particular we can write, for X and any square- 
integrable function x on I, 

00 00 
X = Y,Oy'x,i;„ x = Y,Q'I\,^,, 
i=i i=i 

where the quantities Xj = 0- fj-Xipj and xj = 6^ /j^V'j ^"^^ prin- 
cipal component scores (sometimes referred to as the principal components) 
corresponding to functions X and x. If E{X) = then the above decompo- 
sition of X is termed the Karhunen-Loeve expansion, or generalised Fourier 
expansion. The Xj^s are always uncorrelated (this follows from orthogonality 
of the V'i's), and we shall assume that they are independent. This is exactly 
correct if X is a Gaussian process, and it is almost always assumed to be the 
case in empirical or numerical work. In such cases, as here, independence is 
often interpreted pragmatically; it captures the main features of a popula- 
tion, allows relatively penetrating theoretical analysis and motivates simple, 
practical methodology such as the estimators explored in Section 5 below. 
Particularly in the infinite-dimensional setting of functional data analysis, 
it seems impossible to use effectively general models for random variables 
that are uncorrelated but not independent. Such an approach leads to cum- 
bersome methods and does not seem to allow useful insight into theoretical 
properties. 
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2.2. Log- density function. Many descriptive and predictive functional 
data analyses lean heavily on properties of the space of principal compo- 
nent scores. For example, it is common to describe properties of a sam- 
ple of curves in terms of the shapes of the eigenfunctions corresponding to 
the largest eigenvalues obtained by functional principal component analysis. 
Similarly, it seems natural to define density for functional data in terms of 
the densities fj of the principal component scores Xj, for example, via the 
product of the densities fj corresponding to the largest eigenvalues. The 
notion of log-density, which we shall define below, fills this role. 

For h > 0, let p{x \ h) = P{\\X — x\\ < h) where ||X — x|| denotes the L2 
distance between X and x. We shall show in Section 4 that 

r 

(2.2) logp{x I h) = Ci{r,9) + log /^(x,) + o(r), 

where r = r{h) diverges to infinity as h decreases to zero, fj is the density 
of the jth principal component score, Xj is the version of that score for 
the function x and both r and the constant Ci depend on h and on the 
infinite eigenvalue sequence, say. (Neither r nor Ci depends on x or on the 
distributions of the principal component scores.) The term J2j<r^^S fji^j) 
in (2.2) typically diverges at rate r as r is increased, and in particular it is 
generally not equal to o(r). 

Result (2.2) implies that, for appropriate eigenvalue sequences, 

the log-density i{x \ r) = i"~^Ylj<r^°Sfji^j) captures the variation, 
(2 3) ^^^^ °^ logarithm of the small-ball probability p{x \ h), up 
to and including terms of order r and in particular gives rise to a 
remainder of strictly smaller order than r. 

[We have divided by r only to ensure that i(x \ r) remains bounded as r 
increases which makes it easier to discuss in theoretical terms. Of course, 
division by r does not alter the main features of L] More explicitly, we shall 
prove in Section 4 that 

(2.4) p{x I h) = C2(r, 9) exp{r^(a; | r) + o(r)}, 

where C2(r, 6) = (/i7r^/^)'"r(^r + 1)~^ nj<r ^^"^ does not depend on x. That 
is, 

(2.5) P{x I h) = ^|^^|ri^7'/V.(x,)| exp{o(r)}. 

One implication of the approximation at (2.4) is that it allows us to extract 
a function of x, namely the log-density £(x | r), which captures the first-order 
effect that x has on p{x \h). In other words, the log-density describes the 
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main differences in sizes of small-ball probabilities for different values of x. 
Moreover, up to terms that are negligible relative to those captured by the 
log-density \ogp{x \ h), i(x | r) is a monotone increasing function oi p{x \ h). 
Therefore, while i{x \ r) cannot, in general, be employed to compare densi- 
ties for different random function distributions, it can be used as the basis 
for comparing density at different points x for the same random function 
distribution and for dimension r. Another implication of (2.4), which we 
shall derive in the Appendix, is that a probability density function for X 
does not exist. 

The principal component scores Xj and xj are related to the squared L2 
distance between X and x by the formula 



Although the xj 's are obviously linked to the eigenvalues 0j , in terms of the 
way these two quantities influence the distance X — x from zero [see (2.6)] , 
it can be seen from (2.5) that the Oj^s and Xj's are largely disconnected in 
terms of the way they influence the probability that X — x is only a small 
distance from zero. In particular, we could arbitrarily permute the densities 
fi, ■ ■ ■ , fr without influencing anything other than the o(r) term on the right- 
hand side of (2.5). Additionally, (2.4) makes it clear that the densities of 
the principal component scores are important only through their aggregate, 
defined by £{x \ r) in (2.3), and are of relatively little individual relevance. 

2.3. Defining the mode. The log-density can be used to define a notion 
of central tendency in a population of curves. In the literature, central ten- 
dency is sometimes measured by the mean function. While this quantity is 
very easy to calculate, and it is close to its analogous definition in finite- 
dimensional settings, it is well known that it is generally unsatisfactory as 
a measure of "average" in the context of functional data since it tends to 
average out most of the fiuctuations. For example, when applied to non- 
registered data or to data which cannot be perfectly aligned the averaging 
process often results in a mean curve that does not share typical properties 
(such as oscillations) of the population of curves. 

As an alternative, we propose representing central tendency by the "modal 
function," 



00 



(2.6) 
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which, for each j, has the jth principal component score xj equal to the 
mode rrij of fj. In a finite sample, Xmodc can be estimated by 

T 



where ipj and 6j are estimators of ipj, and 9j, rhj is the mode of fj, fj an 
estimator of fj, and T is a truncation point which grows with the sample 
size. 

Of course, in the case of functional data as well as multivariate data, we 
could also use the median to measure central tendency. In the context of a 
random function X, the median curve can be defined (analogously to the 
spatial median) to be the function x that minimizes — x|| (note that 

the theoretical mean minimizes E'HX — In practice, this median func- 
tion can be estimated from the data by an iterative algorithm described in 
Gervini (2008). It can be shown through experimentation that the median is 
not as susceptible to the problem of averaging fluctuations as the mean but 
often more susceptible than the mode. For instance, in problems where the 
population consists of two or more well-separated sub-populations, the mean 
and the median can represent a function that is central in a strict mathe- 
matical sense but not representative of any function in any sub-population; 
whereas the modal function will often represent the most likely function in 
one of the sub-populations, and therefore will be less abstract and more 
interpretable than the mean or the mode. Nevertheless, in important cases 
(e.g., when X is a Gaussian process) the mean, median and mode are iden- 
tical. 

3. Estimation of density of principal components, and log-density esti- 
mation. 

3.1. Empirical estimation of the density of principal component scores. 
In this section we show how to estimate the densities fj of the principal 
component scores. This result will be used to provide an estimator of the 
log-density i, but it is of intrinsic interest, since having access to the densities 
fj can also be very useful for descriptive analysis of functional data. The 
/j's contain indeed valuable additional information about the structure of 
the population, compared to just the 9j^s and ipj^s. 

Starting from independent data ^[i], . . . ,X[n] on X, compute 



(3.1) K{s,t) = - Y.{X[^{s) - X{s)}{X[^{t) - X{t)} = Y,Oj^jis)^Pjit), 





n 



oo 
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where X = n ^ X^j-^^fj], the expansion in (3.1) is the empirical analogue of 
that at (2.1) and the terms are ordered so that > ^2 ^ ■ ■ ■ • Thus we are 
centring the data at the sample mean rather than at the true mean which is 
of course unknown. We interpret 6j and ipj as estimators of the eigenvalues 
9j and eigenfunctions V'j, respectively. (We use square- bracketed subscripts 
so as not to confuse the ith data value Xjjj with the zth principal component 
score, Xj, of X.) See, for example, Ramsay and Silverman [(2005), Chapters 
8-10]. Then we calculate approximations = 9- /^(-'^[i] — to the 
principal components = 9j /i(-^[j] ~ EX^^)'ipj. We define too xj = 
9 J ^^"^ jj-{x — X)'ipj, an estimator of xj = 9- ^^'^ fj-{x — EX)ipj. 

— 1/2 

An estimator fj of the probability density function fj of 9^ {Xj — EXj ) 
can be computed using the following standard kernel methods: 

i=i ^ ^ 

where h denotes a bandwidth and is a kernel function. For an introduction 
to kernel density estimation [see, for example, Silverman (1986) and Wand 
and Jones (1995)] . The value of h could be chosen using standard methods for 
random data, reflecting the fact that -'^[jj] , for 1 < i < ?i is an approximation 
to the independent sequence X[ij], \ <i<n. 

Provided the jth eigenvalue 9j is not equal to either 9j-\ or the 
estimators 9j and V'j are root-n consistent for 9j and il)^ (modulo a change 
of sign of V'j), respectively, and so Xj = Xj + Op(n~^/^). Note too that X 
cancels from the numerator inside the kernel in the definition of fj{xj), and 
in fact. 

In Section 3.2 below we show that this estimator is first-order equivalent to 
its "ideal" counterpart, 

which we would use if we knew 9j and V'j- Properties of fj{xj), as an esti- 
mator of fj{xj), can be worked out using standard arguments. In particular, 
fj{xj) has variance and bias asymptotic to wfj{xj)/{nh) and ^W2fj {xj)h? , 
respectively, where w = J W"^ and W2 = J v?W{u) du. 
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Our estimator of the log-density i{x | r), in (2.3), is given by 

1 

(3.5) i{x\r) = -Y,^ogfj{xj). 

^ i=i 

An attractive feature of i{x \ r) is the ease with which it can be computed 
for a range of values of r. 

3.2. Theoretical properties. Here we show that the estimators at (3.3) 
and (3.4) are uniformly first-order equivalent. Since the variance and bias of 
fj{xj) are generally of exact orders /i^ and {nh)~^, respectively, then first- 
order equivalence is attained if fj — fj = Op{(n/i)~^/^ + h'^}. Result (3.10) 
below is a strong form of this property. 

The conditions we impose are the following: 

for each C > and some 5 > 0, sup^gj -E{|X(t)p} < oo and 
sup,^t^i,,^tE[{\s-t\-^\Xis)-Xit)\}^]<oo; 

for each integer r > 1, O'^^ E{Jj-{X — EX)ipi^}'^''' is bounded uniformly 
in k; 

(3.8) there are no ties among the j + 1 largest eigenvalues; 

the density fj of the jth principal component score is bounded and 
has a bounded derivative; the kernel is a symmetric, compactly 

(3.9) supported probability density with two bounded derivatives; for some 
S > 0, h = h{n) = 0{n~^) and n}~^li^ is bounded away from zero as 
n — )• oo. 

Note that the assumptions on h in (3.9) permit a bandwidth of conven- 
tional size, that is, /i ~ const. for any positive constant. The theorem 
remains valid if W is the standard normal density, but more generally, in- 
finitely supported kernels require assumptions about the rate at which their 
tails decrease. The use of infinitely supported kernels would alleviate difficul- 
ties that might arise when calculating log /j(n). However, the main features 
of the log-density i (e.g., its modes) are identical to those of its exponenti- 
ated form. Therefore, in practice, to describe the main properties of a sample 
of curves it is not necessary to calculate logarithms, and we can simply work 
with the product of the estimated densities fj. 

Given a square-integrable function x defined on X, put = Jj-x{t)'^ dt. 
For each c > 0, let 5(c) denote the set of x such that ||x|| < c. Recall that 
fj{xj) and fj{xj) can be interpreted as functionals of x, and are defined by 
(3.3) and (3.4), respectively. 



(3.6) 
(3.7) 
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Theorem 3.1. // (3.6)-(3.9) hold then, for all c> 



(3.10) sup - fjix,)\ = o{K)-i/2}. 



For the sake of brevity the proof of Theorem 3.1 is omitted. It can be 
found in a longer version of this paper [Delaigle and Hall (2008)]. 

3.3. Joint density estimation. Part of the simplicity of the log-density 
estimator, at (3.5), is that it involves only the marginal principal component 
score density estimators, fj, and not estimators of the joint densities of those 
scores. Of course, this is a consequence of the assumption that the scores are 
independent, but without that assumption the working statistician would be 
faced with not only a substantially more complicated density approximation, 
but the need to estimate joint densities. The latter problem is itself very 
challenging, unless sample size is large, since the accuracy of nonparametric 
density estimators decreases rapidly as dimension increases. Therefore the 
estimators that are produced under the assumption of independence enjoy a 
simplicity, that is, in many cases, a prerequisite for practical implementation. 

4. Theoretical studies leading to results (2.3) (2.5). 

4.1. Assumptions. Let X and x be, respectively, a random and a fixed 
function on I, and let Xi,X2, . . ■ and xi,X2, ... be their scores, defined in 
Section 2.1. For simplicity we assume that E{X) = 0, but if this condition 
does not hold then the mean of X can be incorporated into Xj by adding 
a term E{Xj). For each j, let fj be the density of Xj, and note that, by 
definition of the scores, the Xj^s are uncorrelated and have mean zero and 
variance 1; in this work we assume that they are independent. See the last 
sentence of Section 2.1 for discussion of this condition. 

For j = 1, 2, . . . , let Wj = Xj — Xj and let gj denote the probability density 
of Wj. Thus, Wi, W2, ■ ■ ■ are independent random variables and, for all real 
w, gj{w) = fj{w + Xj). For a given sequence of x/s we assume that 



and that the sequence 9i > 82 > ■ ■ ■ of eigenvalues associated with the co- 
variance of the function X are positive numbers such that 



(4.1) 



sup 00 



00 



(4.2) 




i=i 



Note that, by (4.1) and (4.2), the series 0jWf 
1. 



converges with probability 
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Suppose too that each gj is difFerentiable at the origin, with gj{0) / 0, and 
define pj = gj{0)/gj{0). We shall assume that the densities gj admit Taylor 
expansions about the origin. In particular, we ask that, for each A > 0, there 
exist a finite constant ^(A) > such that 

(4.3) sup|/9j|<oo, sup sup w~'^\gj{wj)gj{0)~^ — (1 + pjWj)\ < A{X). 
i>i i>ikjl<A 

To understand the type of conditions this requires of the /j's and Xj's, 
assume that the densities fj all have two bounded derivatives, and that for 
each A > 0, 

(4.4) inf inf /,(n)>0, sup {|/j(^z)| + |/;(^x)|} < oo. 

J>1|«I<A |n|<A 

For example, (4.4) holds if the principal components of X are identically 
distributed with a density that has a bounded second derivative and does 
not vanish on the real line. Then (4.3) holds with gj{w) = fj{w + Xj) for any 
bounded sequence of real numbers Xj. The case of an unbounded sequence 
Xj can also be treated, but rather than the more general conditions imposed 
above, it requires assumptions and arguments that are related to specific 
density types. Therefore we shall not develop that case here. 

4.2. Approximation of the small ball probability. Our first result, Theo- 
rem 4.1 below, will underpin our approximations to the value of 

(4.5) p{x\h)=p{h) = p(^^djWf <h^^ 
as hlO. [To derive (4.5) we used (2.6).] 

1/2 

Given h and A satisfying < h< XO^ , we shall suppose that r = r{h) > 1 
has been chosen such that 

(4.6) e-^h^ < A^. 

Define S = Y^j>r+i ^j^f^ ^^^d let G = G{- \ h, r) denote the distribution 
function of S. Our first approximation to p{h), at (4.5), is given by q{h), 
described in the following theorem. The proof of the theorem is given in 
Section 6. 

Theorem 4.1. Assume (4-l)''(4-3), and that r is chosen so that (4-6) 
holds. Then 

(4.7) p{h) = e^p{uj{h,X)X^}q{h), 
where 

(4.8) g{h) = ^|^|fl^7^/V,(x,)| j\l-tY'UG{tl 
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\(jj{h,X)\ < B{X) and the function B{X) > is nondecreasing in A and does 
not depend on h or r. 

Next we apply Theorem 4.1 to develop more specific approximations to 
p{h) for small h. Our results depend on the rate of convergence of the se- 
quence 6j to zero, and we consider two cases. We shall say that a sequence 
is "superexponential" if 

(4.9) 6k+i/6k^0 as/c-^oo, 

or equivalently, if 9j^^ ^j>k+i^j ~^ More generally, we shall say that the 
sequence is "exponential ' if 

oo 

(4.10) 6*^^ ^ 6'j is bounded as/c— ^cx). 
j=k+i 



When the eigenvalues converge to zero at a slower rate, nonparametric meth- 
ods, where the notion of a functional-data density is typically employed, have 
much lower performance and so are less attractive and less likely to be used. 

We also need to define the effective dimension, r = r{h), for a given value 
of scale, h. In the superexponential setting, if for some s the value of fP' jQs is 
"sufficiently close to 1" then we should take r = s, but otherwise we should 
take r to be the unique integer for which 6j.+i < <9r- More specifically, 

there exists a sequence of positive constants ci,C2,..., depending 
on the eigenvalue sequence 61,62, - ■ ■ and diverging to infinity, such 

(4.11) that, if (for a given h) there exists s > 1 such that \log{h'^ /6s)\ < Cg, 
then we take r = r{h) to be the infimum of such values; and if no 
such s exists then we take r to be the value for which 6r+i <h? < 6r- 

In the case of an exponential sequence we define 

(4.12) r = r(/i. A) = argmax{j: 61-^/1^ < A^}, 

and, for j = 1, 2, we let 5j{s, A) denote a quantity which satisfies 

(4.13) lim limsup(5j(s. A) = 0. 

A^oo s_>oo 

Put Qj = log dj and c/ij = log fj {xj ) . The proof of the next theorem is 
given in Section 6. The first part of (4.14) below is identical to (2.5). 

Theorem 4.2. Assume (4.1)-(4-3). In the superexponential case, for r 
as at (4-11), 



(/i7rV2 



r 



pW = f^^}^exp{o(r)}|n^7'^V,(x,) 



(4.14) 



exp 



1 

-r{log(2^e/i2) - log r + o(l)} + ^(9,- + • 
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as /i — 7- 0, and when 6j is exponential, for r as at (4-12), 
P^^^ = r(t/2?i) exp{rJi(r, A)}|ng-^/V,(x,) 

(4.15) 



exp 



-r{log(27re/i2) - logr + <52(r, A)} + ^(9^- + 



where 5i{r,X) and 62{r,X) satisfy (4-13). 

Theorem 4.2 shows that, for appropriate eigenvalue sequences, the approx- 
imations given at (2.4) and (2.5) hold. These approximations are appropri- 
ate when the eigenvalue sequence 6j decreases to zero at an exponential rate 
which is the most important case from a practical viewpoint. 

4.3. Other implications of the theorems. The integer r = r{h) represents 
the dimension in which we make an approximation to the small-ball prob- 
ability p{x I h) at scale, or resolution level, h. In particular, (2.3) links the 
notion of density to dimension rather than, as is more commonly the case, to 
small-ball radius. In theoretical terms the connection is expressed through 
simple formulae such as (2.4) or (2.5). From an empirical viewpoint, (2.3) 
suggests that, rather than attempt to estimate small-ball probabilities for 
different values of /i, so as to get a good idea of the way in which the notion 
of density changes as scale becomes finer, we can instead estimate the values 
of l{x I r) for different values of r. 

Note that l{x \ r) can be interpreted for increasing finite values of r, but 
l{x I oo), which we might define by taking the limit as r — )• oo in (2.3), does 
not necessarily exist. In particular, we could change any finite number of 
the densities fj without altering the definition of density on an infinitesimal 
scale. Therefore, density on an infinitesimal scale, that is, as /i —t- or as 
r — )• oo, is not identifiable unless we assume a model which asserts sufficiently 
close connections between early densities and principal component scores, 
and later ones. 

An example where li{x \ oo) is often well-defined arises when x is taken to 
be the modal function Xmode at (2-7). In that case, in order for the value of 
the log-density ^{x \ r) to be well defined as r — )• oo, it is necessary only that 
r~^ "^j^og fj{mj) converge. In particular, this condition is satisfied trivially 
if all the distributions of principal component scores are identical. 

More generally, the value of r can be interpreted as the dimension of the 
scale space when the unit of scale is h. The need to take scale, or resolu- 
tion level, into account when discussing the density of a random function 
reflects the importance of scale in other settings. For example, Chaudhuri 
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and Marron (1999, 2000) and Godtliebsen, Chaudhuri and Marron (2002) 
took a scale-space view of function estimation, noting that the viewpoint 
was ah'eady commonly used in areas such as imaging. There, the authors 
argue that one can learn different properties of the population at each scale 
where larger scales explain the overall structure of the population, and finer 
scales help understand the finer structure. 

For a given scale h, the results of Theorem 4.1, and in particular (4.11) 
and (4.12), indicate that the effective dimension r satisfies 

(4.16) h^^Or. 

Property (4.16) implies that, if we are considering two distinct random func- 
tion distributions for which the respective eigenvalue sequences decrease at 
different rates, then, for a sufficiently small value of scale, h, the correspond- 
ing dimension, r, is greater in the case of the random function with the less 
rapidly decreasing eigenvalue sequence. Of course, this makes intuitive sense. 

One could determine the value of r empirically by using relatively con- 
ventional methods for dimension estimation. See, for example, Horn (1965), 
Velicer (1976), Zwick and Velicer (1986), Peres- Neto, Jackson and Somers 
(2005) and Hall and Vial (2006). Alternatively we could simply estimate 
i{x I r) for an increasing sequence of values of r, accessing in this way in- 
formation about how density changes as we increase dimension and learn- 
ing different properties of the population for each r. Theoretical properties 
of empirical principal components are discussed by, for example. Hall and 
Hosseini-Nasab (2006, 2009). 

5. Numerical studies. 

5.1. Australian rainfall data. We applied our density and mode 
estimation methods to an Australian rainfall dataset, available at 
http : //dss . ucar . edu/datasets/ds482 . 1 and analysed by Lavery, Kariko 
and Nicholls (1992). The data consist of daily rainfall measurements be- 
tween January 1840 and December 1990, at each of 191 Australian weather 
stations. The function X{t) represents the rainfall at time t where t denotes 
the period that has passed, in a given year, at the time of measurement. 
Rainfall at time t was averaged over the years for which the station had 
been operating with the aid of a local polynomial smoother passed through 
discrete observations. One weather station (the 190th station) was removed 
from the collection of 191 since its rainfall pattern was very different from 
those for all other stations. Figure 1 depicts the yearly rainfall curves of 
the remaining 190 stations. On the left we show those stations (usually lo- 
cated in the north) which exhibit a "tropical" pattern, that is, those where 
most rain fell in mid to late summer; and on the right we show the stations 
(usually in the south) where the majority of rain came in cooler months. 
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Fig. 1. Australian rainfall data at weather stations which get the most rainfall during 
the summer months (left) or during the winter months (right). 

Table 1 indicates that the values of 6j decrease very quickly, and so it 
seems reasonable to use only the first few principal components to describe 
the data. In Figure 2 we plot the first four estimated principal component ba- 
sis functions and the corresponding densities fj of the standardised principal 
components scores. One interpretation of the first two principal component 
functions is that they capture, respectively, two key features of the data — 
peak rainfalls around days 35 and 215 and peak rainfalls around days 30 and 
180. In each case, the first class of weather station generally corresponds to 
towns with a tropical or semi-tropical climate and the second to towns with 
a mid-latitude climate. Taken together, these two principal components cap- 
ture the dichotomy between the two main latitude-determined climate zones 
in Australia together with the subtler effect of rainfall peaks that occur sep- 
arately in either winter or summer, but where the peaks within either class 
can nevertheless differ by months. 

We have chosen the signs of the first two principal component functions 
so that the first has its minimum in late winter whereas the minimum of 
the second is in late summer. However, this feature could easily be altered 
by a sign change; the signs of principal component functions are not deter- 
mined. The densities of each of the first three principal component scores 
are skewed. Of course, the direction of skewness is tied to the sign of the 
principal component function which is arbitrary. The skewness is one aspect 
of the distinct non-Gaussian nature of the rainfall curves. The densities of 
higher-order principal component scores are less asymmetric; we plot only 
the fourth. 

Table 1 

Proportion of variance explained by the first j principal components, for j — 1, ... ,10 in 
the Australian rainfall data example 

j 123456789 10 



0.7380 0.9510 0.9811 0.9915 0.9957 0.9974 0.9984 0.9989 0.9992 0.9995 
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Fig. 2. Plots of the first four principal component basis functions (row 1) and the cor- 
responding estimates of densities of principal component scores, fj (row 2). In each row, 
the graphs from left to right are for decreasing values of 9j . 



Figure 3 shows the estimated mean and modal functions of the rainfah 
data where the modal function was calculated as at (2.8) and therefore de- 
pended on the number, T, of components used. Of course, since Xmode at 
(2.8) estimates the mode of the centered data, we have added the mean func- 
tion to each modal curve. (The mean function is represented by the heavy, 
unbroken curve in Figure 3.) It is clear from that figure that changing T 
from 1 to 5 alters the modal function significantly, but the effect of changing 
T from 6 to higher values is almost indistinguishable by eye. The mean curve 




Fig. 3. Curves representing the mean and mode, respectively, of Australian rainfall data, 
using T = 1,2,3,4 or 5 (first panel) or T — 6,7,8,9 or 10 (second panel). The annotation 
"mode j" indicates imodo at (2.8) in the case T — j. On the right panel we also show the 
median curve. 
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Fig. 4. Contour plot of the estimated surface 2exp{£{x \ 2)} = /i(a;i)/2(a;2) for the 
Australian rainfall data. The pink crosses represent the values o/ /i(A'[ii])/2(X[i2]) for 
i = 1, . . . ,n, where Xyij^ is the jth centered and scaled principal component score of the ith 
data curve. 

appears to be strongly influenced by the few stations that have high rainfall, 
but the modal curve is noticeably more robust. The figure also shows the 
median curve which we calculated using the algorithm described in Gervini 
(2008). The median lies between the mean and the mode, a property which 
is known to hold for many univariate distributions [see Haldane (1942) and 
Hall (1980)] but has not been studied previously for functional data. In this 
example the median has similarities with the modal curve, but it is still a 
bit high due to the influence of the tropical weather stations, especially in 
the summer months. These features make the median curve less appealing 
then the modal curve which looks more typical of curves for a majority of 
towns with mid-latitude climates. 

Figure 4 shows a contour plot of 2exp{£(x | 2)} = fi{xi)f2{x2), that is, the 
estimated surrogate density for r = 2 together with the values /i {X^n] ) /2 (^[i2] ) > 
for i = 1, . . . ,n, represented by pink crosses. The colors range from blue for 
low density values to yellow for high density values. 

We cannot visualise the estimated density i{x \ r) at higher resolution 
levels (i.e., for r > 3) by showing a surface curve. To see the effect that 
increasing r has on £(x | r), we calculated this density for r = 1, . . . , 10 and 
for X = ^[1], . . . , X\^n] (i-e-i foi' each data curve), and then, for each r, classified 
the n data curves into several groups according to the value of ^(Xjjj | r), 
using the same color code as Figure 4, that is, using colors ranging from 
blue for the lowest values of £(• | r) to yellow for the largest values. We show 
in Figure 5 the groups of curves obtained for r = 2 and r = 10. We see that, 
overall, the curves of low (respectively, moderate or high) density for r = 2 
correspond to the curves of low (respectively, moderate or high) density for 
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Fig. 5. Plots of the 190 rain data curves. In each row, the furthest left graph shows 
all the curves and the other graphs show, from left to right, groups of curves for indices 
i that correspond to increasing values of £{X[i^ \ r). The top row depicts results when 
r = 2 whereas the bottom row corresponds to r = 10. For r = 2 we use the color code 
corresponding to Figure 4, o-nd for r = 10 we use a similar code corresponding to the 
estimator of £{X[i^ \ 10). 



r = 10. In other words, the density at resolution r = 2 already reflects the 
main features of the data. The blue curves roughly correspond to the stations 
for which rainfall varies the most over the year; these stations are very 
heterogeneous and thus have low density. At the other end of the spectrum, 
the yellow curves correspond to the stations with the flattest yearly rainfall; 
these stations are quite homogeneous and, logically, they have the highest 
density. The green curves correspond to a moderate rainfall change over the 
year and have moderate density values. 

5.2. Simulated examples. As discussed at the end of Section 2.1, in the 
setting of functional data analysis it can be quite difficult to undertake 
meaningful inference without the simplifying condition of independence of 
the principal component scores Xj. In particular, without that assump- 
tion, to represent the joint density of Xi,...,Xr we would need to re- 
place Y\i<j<r fji^j)^ ^ product of univariate functions, by a more complex 
r-variate function fi^...,r{xi, . . . ,Xr). However, the difficulty of estimating 
the latter increases rapidly with dimension. Therefore, unless samples sizes 
are particularly large, the quality of multivariate nonparametric estimators 
can be so poor that greater insight about the population is gained from 
estimators under the simplifying independence assumption. To illustrate 
this fact, we generated B = 500 samples of size n = 100 from the distri- 

1/2 

bution of X{t) = Yl i<j<io^j -^j'^Pji^) where tG [0,1], the Xj^s were un- 
correlated dependent random variables generated according to Xj = cTVj 
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Fig. 6. MSB of the estimators imodc and imodc for T =1, 2, 3 or 4- In the graphs we 
use multi to denote imodc- The graphs show, from left to right, the results of models (i) 
to (iv). 

where the V^'s were independent and identically distributed, T was a uni- 
form C/[l,2] variable, common to all j's and c = {var(TV^)}~^/^. We took 
ipj{t) = \/2cos(7rjt), and Vj and 6j were chosen from one of four models: 
(i) Vj ~ x'(8) - 8, Oj = (ii) Vj ~ x'(8) - 8, e, = (iii) Vj ~ N(0, 1), 
9j=j-^; (iv) Vj ~ N(0, 1), Oj = j'^ where "~" means "is distributed as." 

In each case we calculated the estimator of the modal function, Xmode- We 
compared rcmode at (2-8) where each fj was estimated by a univariate kernel 
density estimator with plug-in bandwidth using the functions kde and hpi 
in Duong, Wand and Chacon's R package ks with the estimator Xmodc = 

X^i<j<T where {rhi, . . . jItit) was the mode of a T-variate kernel 
density estimator with plug-in bandwidth, calculated using the functions 
kde and Hpi in the R package ks. 

In each case we calculated MSE(t) = B^^ Z^i<b<_B{yfe(0 ~ ^^modcCO}^ 
Vb = Xh^modc and y = Xfe.modc where the index b indicates that the estimator 
was calculated from the bth sample. In Figure 6 we present the MSEs of both 
estimators of Xy^odc for r = 1, 2, 3 and 4. The estimator £b,mode performed 
better than Xfo^mode in all cases. Indeed, the quality of Xb^mode deteriorated 
very quickly as T increased, due to greater bias and increased stochastic 
error; both these difficulties reflect the curse of dimensionality. For models 
(i) and (ii), the best results were obtained by Xb,mode with T = 3 or 4, whereas 
for models (iii) and (iv), the best results were for Xfe^modc with T = 1. This 
reflects the fact that for models (i) and (ii), truncating the sum at T < 10 
in the definition of Xmode 

introduced a systematic bias whereas in models 
(iii) and (iv), each rrij = and thus truncating the sum at T < 10 did not 
produce any bias. 

6. Technical arguments. 

6.1. Proof of Theorem 4.1. Define = Zj<r ^j^f, S = Ej>r+i 
ht = {l- tf/'^h and 

(6.1) p,{h) = P{Si<h'')= [ \f\gj{wAdwi---dwr. 
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Let G = G{- I h,r) denote the distribution function of S. In this notation, 

(6.2) p{h) = P{Si + h^S <h^)= [ P(Si<hl)dG{t)= [ pr{ht)dG{t). 

Jo Jo 

Property (4.3) imphes that 

r ( r ^ ( ^ / ^ \^ 

(6.3) ]lgj{wj) = l]Jgj{0)\e^plY.P^'^^ + oi^\w,n , 
j=i [j=i ) [j=i \j=i / J 

uniformly in wi, . . . ,Wr such that 

(6.4) sup \wj\ < A. 

i<i<r 

If (4.6) holds then so too does (6.4), provided Ylij<r^j'^'] — Therefore, 
by (6.1) and (6.3), 

Pr{ht) = h^'l TT5i(0) \ / exp< h'S^PjWj + h^a{w) > dw, 

[j=i J Jj:,<rS,v'^<^-t [ ,=1 J 

(6.5) 

where w = {wi, . . . ,Wr), the function a does not depend on t and, for a 
constant Ci > 0, 

r 

(6.6) \a{w)\ < Ci^\wj\^ 
uniformly in w for which 

r 

(6.7) Y.^jw]<l. 

i=i 

The constant Ci depends on A, of which it is a nondecreasing function. 
If (4.6) and (6.7) hold, then 

r r r 

(6.8) h'^ w] < \^6r w] <\^Y ^J^^i - 

j=i j=i j=i 

These properties, (6.5) and (6.6) imply that, for a constant C2 > 0, 

(6.9) h^\a{w)\ < C2X^ 

uniformly in lu for which (6.7) holds. Here, C2 = C2(A) is a nondecreasing 
function of A. 
Let 



(6.10) /= / dw = vr{i-ty/^f]e;^^^- 
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where 

denotes the content of the r-variate unit sphere. The second identity in 
(6.10) foUows from the fact that, in view of the first identity, I equals the 
content of the ellipsoid having the equation ^j^^OjVuj = 1 — t. 
Results (6.5) and (6.9) imply that 

(6.12) p,(/it) = /i^|f[5j(0)|exp{a;i(t|/i,A)A2}J, 

where, here and in (6.16) below, the function ojj satisfies 

(6.13) sup \ujj{t\h,\)\ <C3 

0<t<l 

uniformly in h and A such that (4.6) holds for some r > 1, the constant 
C3 > depends only on C2 and the upper bound in (4.3) to sup^ \pj\, and 

(6.14) ^ = /^_^ „^„,,,_,-P^"E''«j ''^-'^W 
with 

Ii= I ^PjWj dw. 

Odd- indexed terms have cancelled from (6.14) through symmetry. 

When calculating Ij using term-by-term expansion of the quantity within 
parentheses, only products of the form (pji ifji ) • • • (pj2i ^i2i ) j where each 
distinct index among ji,...,j2i appears an even number of times, make 
a nonzero contribution. Therefore, 



h^'h< / \h^Y'p]w] \ dw 

(6.15) 

<{p\f^ [ dw = {pXf'l, 

•^E,<.e,«>|<i-t 

where p = supj\pj\ and, since (4.6) and (6.7) hold, we used the bound 
at (6.8). Using the bound (6.15) in (6.14) we deduce that J = exp{u)2it \ 
/i,A)A^}/ where uj2 satisfies (6.13). This result and (6.12) imply that 

(6.16) pr{ht) = /i'^|n5,(0)| exp{L^3(i I h,X)X^}I. 

The theorem follows on combining (6.2), (6.10), (6.11), (6.13) and (6.16). 
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6.2. Proof of Theorem 4-2. To derive (4.14) we treat two complementary 
cases which, if we consider convergence of h to zero along subsequences, 
cover all instances: (a) there exists a sequence of integers r = r(/i) diverging 
to infinity such that, as h converges along a subsequence, h'^ x 9^ (that is, 
h?' /Or is bounded away from zero and infinity as /i — t- 0); and (b) along a 
subsequence, and for r = r{h) diverging to infinity, h? /Or — )• and h? /6r+i — )• 
oo. In case (a) we take to be an upper bound to /Or and note that the 
superexponential condition (4.9) implies that 5 — )■ in probability. From the 
latter property it follows that 

(6.17) ^log|^\l-t)'^/2^G(t)| ^0. 

In case (b) we choose A to be a function of /i, decreasing to zero as 
0, in such a manner that /i^0~^A~^ — )■ and h'^0~/^^\^ — )• oo. The first of 
these properties ensures that (4.6) holds, and the second that 5 — )■ in 
probability, so that, once again, (6.17) obtains. Hence, in either case, results 
(4.8) and (4.7), and Stirling's formula, imply (4.14). More generally, case 
(a) can be extended to that where \\og{h?' /Or)\ < Cr, provided the sequence 
ci , C2 , . . . diverges sufficiently slowly. This is the constant sequence used in 
the definition of r(/i) in (4.11). 

To connect these results to the statement of (4.14) in Theorem 4.2, sup- 
pose that (4.14) fails in that context. Then we can find a sequence hi,h2, ■ ■ ■ , 
decreasing to zero, such that the term written as o(l) on the far right- 
hand side of (4.14) is actually bounded away from zero. Let s = s{h) denote 
the integer for which |log(/i^/^s)| is minimized. If, for all sufficiently large 
k, \log{hl/Os(h^:))\ < Cs(hk)i tlien r{hk) = s{hk) [by the definition of r{h) in 
(4.11)] and the result stated in the second-last sentence of the previous 
paragraph establishes (4.14). Hence, by passing to a sub-subsequence if nec- 
essary, we may assume that |log(/i^/05(/i^))| > Cs(/ij.) for all sufficiently large 
k, in which case [again using the definition of r{h)] Or[hk)+i < ^fc < ^r{hk) 
all large fc, and both h1/0r(^hk) ~^ ^k/^r{hk)+i ~^ However, it then 
follows from case (b) in the previous paragraph that (4.14) holds. Therefore 
(4.14) must hold in the context of Theorem 4.2. 

To prove (4.15), define r as at (4.12) and note that Or^h^ < A^ and 
(^r+i^'^ > '^^^ ^^s* these properties ensures (4.6), and so permits us to 
apply Theorem 4.1, and the second guarantees that, with Ci = supj>]^ 
and C2 denoting the upper bound to ^j>k+i^j ™ (4-10), we have 

E{S)<Cih"^ Oj<Ci\~^0;l^ ^j<CiA-2(i + C2). 

j>r+l i>'r+l 

Therefore, given ei > we can choose A so large that P{S > Si) < ei, and 
hence, given £2 > we can select A sufficiently large, but fixed, to ensure 
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that, for all sufficiently small h, 



(6.18) 



logy \l-tY^'dG{t) 



< e2r. 



Results (4.8), (4.7), (6.18) and Stirling's formula imply (4.14). 

APPENDIX: NONEXISTENCE OF PROBABILITY DENSITY 
FUNCTION FOR FUNCTIONAL DATA 

If X is a random vector of finite length then we generally define the 
probability density, /(x), of X at the point x, as the limit as h decreases 
to zero of the probability that X lies in the ball of radius h centered at x, 
divided by the Lebesgue measure of that ball. For example, in Euclidean 
space of dimension r, 

(A.l) fix) = \im{h'"vry^P{\\X - x\\ < h), 

where || • || denotes Euclidean distance in M'', and Vr represents the content of 
the r-dimensional unit sphere. It might be expected that a formula analogous 
to (A.l), with the divisor h^Vr replaced by a different function of h, would 
be appropriate for estimating the probability density of a random function 
X . However, in general it is not. 

To appreciate why, let X be a random function and x a fixed function, and 
note that if there were to exist a function, a{h) say, such that the probability 
density 

fix) = lim{a(/i)}-^P(||X - x|| < h) 

were well defined, then for all x we would have 

log fix) = lim[- log{a(/i)} + logP(||X - x\\ < h)] 

and thus logP(||X - x\\ < h) = Ci + log fix) + o(l) where Ci = log{a(/i)} 
does not depend on x, and fix) does not depend on h. However, (2.2) shows 
that this is not possible. 
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